Most of the code examples written down below, taken from this website https://r4ds.had.co.nz/data-visualisation.html
You only need to install a package once, but you need to reload it every time you start a new session.
#install.packages("tidyverse") to install package
library(tidyverse)
A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).
We will use mgs dataset which contains contains a subset of the fuel economy data and only models which had a new release every year between 1999 and 2008.
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp…
## 3 audi a4 2 2008 4 manual… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp…
## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp…
## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp…
## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp…
## # … with 224 more rows
The columns of the data set are:
manufacturer: manufacturer name
model: model name
displ: engine displacement, in liters
year: year of manufacture
cyl: number of cylinders
trans: type of transmission
drv: the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd
cty: city miles per gallon
hwy: highway miles per gallon (a car’s fuel efficiency on the highway)
fl: fuel type
class: “type” of car
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
labs(title = "Fuel Efficiency based on Engine Size",x= "Engine Size",y= "Fuel Efficiency")
We can see that there is a negative relationship between engine size and fuel efficiency. Which means cars with big engines consumes more fuel.
ggplot(data = mpg)
It shows an empty plot.
nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
mtcars dataset contains 32 rows and 11 columns.
help(mpg)
drv variable describes the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd.
ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = cyl))
ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
We can not define a relationship.
Aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms. We can change point’s size, shape and color.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
This plot maps colors of the points to classes. we can see which color corresponds to which class on the legend. For example, pink points represents the class suv.
We can map classes to size aesthetic. But mapping an unordered variable (class) to an ordered aesthetic (size) is not recommended.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.
We can also map class to the alpha aesthetic, which helps us to change the transparency or the shape of the points.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
The shape can group maximum 6 classes, so the additional groups like suv class will not be plotted.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
With color parameter, we can change the colors of the points.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
The color does not convey information about a variable, it changes the appearance of the plot. We should write color outside of the aes function.
(Hint: type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
6 variables are categorical which are manufacturer, model, trans, drv, fl, class.
For categorical variables, we can split our plot into facets, which are subplots that each display one subset of the data.
facet_wrap() function facets the plot by a single variable.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
facet_grid() facets your plot on the combination of two variables.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
To not facet in the rows or columns dimension, use a . instead of a variable name.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
A geom is the geometrical object that a plot uses to represent data. To change the geom in your plot, change the geom function that you add to ggplot().
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = (aes(x = displ, y = hwy)))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The below plot shows that geom_smooth() separates the cars into three lines based on their drv value, which describes a car’s drivetrain.
ggplot(data = mpg) +
geom_smooth(mapping = (aes(x = displ, y = hwy, linetype = drv)))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
To display multiple geoms in the same plot, we should add multiple geom functions.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
But the code above, includes some duplication. When you want to change y-axis you should change it in both of the geom_functions. To avoid this we can pass set of mappings to ggplot(). The output plot will be same with the previous one.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
If we add mappings to geom functions, ggplot2 will use these mappings to extend or overwrite the global mappings.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
You can use same thing to specify different data for each layer.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth( data = filter(mpg, class == "subcompact"), se = FALSE )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
This plot also shows count variable which is not variable in diamonds. Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot.
The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.
We can plot the same plot as the previous one with stat_count().
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
This works because every geom has a default stat, and every stat has a default geom.
To display a bar chart of proportion, rather than count.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1) )
stat_summary() summarizes the y values for each unique x value, to draw attention to the summary that is being computed.
ggplot(data = diamonds) +
stat_summary(mapping = aes(x = cut, y = depth), fun.min = min, fun.max = max, fun = median)
You can color a bar chart using either the color aesthetic, or more usefully, fill.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, color = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
If you map the fill aesthetic to another variable, like clarity: the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: “identity”, “dodge” or “fill”.
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) +
geom_bar(fill = NA, position = "identity")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
There’s one other type of adjustment that’s not useful for bar charts, but it can be very useful for scatterplots. On one of the previous plot was displaying only 126 points, even though there were 234 observations in the dataset. It is because the values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting. You can avoid this gridding by setting the position adjustment to “jitter.” position = “jitter” adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
coord_flip() switches the x- and y-axes.
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
coord_quickmap() sets the aspect ratio correctly for maps.
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", color = "black")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", color = "black") +
coord_quickmap()
coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.
bar <- ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()